{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZfVtdwryvvP6"
   },
   "source": [
    "# Ungraded Lab: Training a binary classifier with the Sarcasm Dataset\n",
    "\n",
    "In this lab, you will revisit the [News Headlines Dataset for Sarcasm Detection](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection) from last week and proceed to build a train a model on it. The steps will be very similar to the previous lab with IMDB Reviews with just some minor modifications. You can tweak the hyperparameters and see how it affects the results. Let's begin!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PG_aRXpyx7f6"
   },
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "mGhogK1vx6eW"
   },
   "outputs": [],
   "source": [
    "import json\n",
    "import io\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import tensorflow as tf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "aWIM6gplHqfx"
   },
   "source": [
    "## Process the dataset\n",
    "\n",
    "You can download the dataset with the code below. Here it was already downloaded for you so the code in the next cell is commented out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "BQVuQrZNkPn9"
   },
   "outputs": [],
   "source": [
    "# Download the dataset\n",
    "# !wget https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset is saved as a JSON file. Load it into your workspace and put the sentences and labels into lists."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "oaLaaqhNkUPd"
   },
   "outputs": [],
   "source": [
    "# Load the JSON file\n",
    "with open(\"./sarcasm.json\", 'r') as f:\n",
    "    datastore = json.load(f)\n",
    "\n",
    "# Initialize the lists\n",
    "sentences = []\n",
    "labels = []\n",
    "\n",
    "# Collect sentences and labels into the lists\n",
    "for item in datastore:\n",
    "    sentences.append(item['headline'])\n",
    "    labels.append(item['is_sarcastic'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kw1I6oNSfCxa"
   },
   "source": [
    "## Parameters\n",
    "\n",
    "The parameters are placed in the cell below so you can easily tweak them later:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "wpF4x5olfHX-"
   },
   "outputs": [],
   "source": [
    "# Number of examples to use for training\n",
    "TRAINING_SIZE = 20000\n",
    "\n",
    "# Vocabulary size of the tokenizer\n",
    "VOCAB_SIZE = 10000\n",
    "\n",
    "# Maximum length of the padded sequences\n",
    "MAX_LENGTH = 32\n",
    "\n",
    "# Output dimensions of the Embedding layer\n",
    "EMBEDDING_DIM = 16"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dHibcDI0H5Zj"
   },
   "source": [
    "## Split the dataset\n",
    "\n",
    "Next, you will generate your train and test datasets. You will use the `training_size` value you set above to slice the `sentences` and `labels` lists into two sublists: one for training and another for testing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "S1sD-7v0kYWk"
   },
   "outputs": [],
   "source": [
    "# Split the sentences\n",
    "train_sentences = sentences[0:TRAINING_SIZE]\n",
    "test_sentences = sentences[TRAINING_SIZE:]\n",
    "\n",
    "# Split the labels\n",
    "train_labels = labels[0:TRAINING_SIZE]\n",
    "test_labels = labels[TRAINING_SIZE:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qciTzNR7IHzJ"
   },
   "source": [
    "## Preprocessing the train and test sets\n",
    "\n",
    "As usual, you will generate a `TextVectorization` layer based on the training inputs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "fKriGi-pHCof"
   },
   "outputs": [],
   "source": [
    "# Instantiate the vectorization layer\n",
    "vectorize_layer = tf.keras.layers.TextVectorization(max_tokens=VOCAB_SIZE, output_sequence_length=MAX_LENGTH)\n",
    "\n",
    "# Generate the vocabulary based on the training inputs\n",
    "vectorize_layer.adapt(train_sentences)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fThFbcdhzBcy"
   },
   "source": [
    "Unlike the previous lab (i.e. IMDB reviews), the data you're using here is not yet a `tf.data.Dataset` but a list. Thus, you can pass it directly to the `vectorize_layer` as shown below. As shown in the Week 1 labs, this will output post-padded sequences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "3GkcD_DIenKh"
   },
   "outputs": [],
   "source": [
    "# Apply the vectorization layer on the train and test inputs\n",
    "train_sequences = vectorize_layer(train_sentences)\n",
    "test_sequences = vectorize_layer(test_sentences)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8PgmPPhH1W4t"
   },
   "source": [
    "Now you will combine the inputs and labels into a `tf.data.Dataset` to prepare it for training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "iGrSrH2GSz1y"
   },
   "outputs": [],
   "source": [
    "# Combine input-output pairs for training\n",
    "train_dataset_vectorized = tf.data.Dataset.from_tensor_slices((train_sequences,train_labels))\n",
    "test_dataset_vectorized = tf.data.Dataset.from_tensor_slices((test_sequences,test_labels))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lfawUYiC1_AX"
   },
   "source": [
    "You can view a few examples as a sanity check."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "2JpSZ-D7IG_A"
   },
   "outputs": [],
   "source": [
    "# View 2 examples\n",
    "for example in train_dataset_vectorized.take(2):\n",
    "  print(example)\n",
    "  print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nfU1NwRB2s8k"
   },
   "source": [
    "Then, you will optimize and batch the datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "WY2CTOd1JnrB"
   },
   "outputs": [],
   "source": [
    "SHUFFLE_BUFFER_SIZE = 1000\n",
    "PREFETCH_BUFFER_SIZE = tf.data.AUTOTUNE\n",
    "BATCH_SIZE = 32\n",
    "\n",
    "# Optimize the datasets for training\n",
    "train_dataset_final = (train_dataset_vectorized\n",
    "                       .cache()\n",
    "                       .shuffle(SHUFFLE_BUFFER_SIZE)\n",
    "                       .prefetch(PREFETCH_BUFFER_SIZE)\n",
    "                       .batch(BATCH_SIZE)\n",
    "                       )\n",
    "\n",
    "test_dataset_final = (test_dataset_vectorized\n",
    "                      .cache()\n",
    "                      .prefetch(PREFETCH_BUFFER_SIZE)\n",
    "                      .batch(BATCH_SIZE)\n",
    "                      )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "AMF4afx2IdHo"
   },
   "source": [
    "## Build and Compile the Model\n",
    "\n",
    "Next, you will build the model. The architecture is similar to the previous lab but you will use a [GlobalAveragePooling1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling1D) layer instead of `Flatten` after the Embedding. This adds the task of averaging over the sequence dimension before connecting to the dense layers. See a short demo of how this works using the snippet below. Notice that it gets the average over 3 arrays (i.e. `(10 + 1 + 1) / 3` and `(2 + 3 + 1) / 3` to arrive at the final output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "7KDCvSc0kFOz"
   },
   "outputs": [],
   "source": [
    "# Initialize a GlobalAveragePooling1D (GAP1D) layer\n",
    "gap1d_layer = tf.keras.layers.GlobalAveragePooling1D()\n",
    "\n",
    "# Define sample array\n",
    "sample_array = np.array([[[10,2],[1,3],[1,1]]])\n",
    "\n",
    "# Print shape and contents of sample array\n",
    "print(f'shape of sample_array = {sample_array.shape}')\n",
    "print(f'sample array: {sample_array}')\n",
    "\n",
    "# Pass the sample array to the GAP1D layer\n",
    "output = gap1d_layer(sample_array)\n",
    "\n",
    "# Print shape and contents of the GAP1D output array\n",
    "print(f'output shape of gap1d_layer: {output.shape}')\n",
    "print(f'output array of gap1d_layer: {output.numpy()}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "evlU_kqOshc4"
   },
   "source": [
    "This added computation reduces the dimensionality of the model as compared to using `Flatten()` and thus, the number of training parameters will also decrease. See the output of `model.summary()` below and see how it compares if you swap out the pooling layer with a simple `Flatten()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "FufaT4vlkiDE"
   },
   "outputs": [],
   "source": [
    "# Build the model\n",
    "model = tf.keras.Sequential([\n",
    "    tf.keras.Input(shape=(MAX_LENGTH,)),\n",
    "    tf.keras.layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM),\n",
    "    tf.keras.layers.GlobalAveragePooling1D(),\n",
    "    tf.keras.layers.Dense(24, activation='relu'),\n",
    "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
    "])\n",
    "\n",
    "# Print the model summary\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GMxT5NzKtRgr"
   },
   "source": [
    "You will use the same loss, optimizer, and metrics from the previous lab."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "XfDt1hmYkiys"
   },
   "outputs": [],
   "source": [
    "# Compile the model\n",
    "model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Axtd-WQAJIUK"
   },
   "source": [
    "## Train the Model\n",
    "\n",
    "Now you will feed in the prepared datasets to train the model. If you used the default hyperparameters, you will get around 99% training accuracy and 80% validation accuracy.\n",
    "\n",
    "*Tip: You can set the `verbose` parameter of `model.fit()` to `2` to indicate that you want to print just the results per epoch. Setting it to `1` (default) displays a progress bar per epoch, while `0` silences all displays. It doesn't matter much in this Colab but when working in a production environment, you may want to set this to `2` as recommended in the [documentation](https://keras.io/api/models/model_training_apis/#fit-method).*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "2DTKQFf1kkyc"
   },
   "outputs": [],
   "source": [
    "num_epochs = 10\n",
    "\n",
    "# Train the model\n",
    "history = model.fit(train_dataset_final, epochs=num_epochs, validation_data=test_dataset_final, verbose=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "L_bWhGOSJLLm"
   },
   "source": [
    "## Visualize the Results\n",
    "\n",
    "You can use the cell below to plot the training results. You may notice some overfitting because your validation accuracy is slowly dropping while the training accuracy is still going up. See if you can improve it by tweaking the hyperparameters. Some example values are shown in the lectures."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "2HYfBKXjkmU8"
   },
   "outputs": [],
   "source": [
    "# Plot utility\n",
    "def plot_graphs(history, string):\n",
    "  plt.plot(history.history[string])\n",
    "  plt.plot(history.history['val_'+string])\n",
    "  plt.xlabel(\"Epochs\")\n",
    "  plt.ylabel(string)\n",
    "  plt.legend([string, 'val_'+string])\n",
    "  plt.show()\n",
    "\n",
    "# Plot the accuracy and loss\n",
    "plot_graphs(history, \"accuracy\")\n",
    "plot_graphs(history, \"loss\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JN6kaxxcJQgd"
   },
   "source": [
    "## Visualize Word Embeddings\n",
    "\n",
    "As before, you can visualize the final weights of the embeddings using the [Tensorflow Embedding Projector](https://projector.tensorflow.org/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "c9MqihtEkzQ9"
   },
   "outputs": [],
   "source": [
    "# Get the embedding layer from the model (i.e. first layer)\n",
    "embedding_layer = model.layers[0]\n",
    "\n",
    "# Get the weights of the embedding layer\n",
    "embedding_weights = embedding_layer.get_weights()[0]\n",
    "\n",
    "# Print the shape. Expected is (vocab_size, embedding_dim)\n",
    "print(embedding_weights.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "LoBXVffknldU"
   },
   "outputs": [],
   "source": [
    "# Open writeable files\n",
    "out_v = io.open('vecs.tsv', 'w', encoding='utf-8')\n",
    "out_m = io.open('meta.tsv', 'w', encoding='utf-8')\n",
    "\n",
    "# Get the word list\n",
    "vocabulary = vectorize_layer.get_vocabulary()\n",
    "\n",
    "# Initialize the loop. Start counting at `1` because `0` is just for the padding\n",
    "for word_num in range(1, len(vocabulary)):\n",
    "\n",
    "  # Get the word associated with the current index\n",
    "  word_name = vocabulary[word_num]\n",
    "\n",
    "  # Get the embedding weights associated with the current index\n",
    "  word_embedding = embedding_weights[word_num]\n",
    "\n",
    "  # Write the word name\n",
    "  out_m.write(word_name + \"\\n\")\n",
    "\n",
    "  # Write the word embedding\n",
    "  out_v.write('\\t'.join([str(x) for x in word_embedding]) + \"\\n\")\n",
    "\n",
    "# Close the files\n",
    "out_v.close()\n",
    "out_m.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1GierJvdJWMt"
   },
   "source": [
    "## Wrap Up\n",
    "\n",
    "In this lab, you were able to build a binary classifier to detect sarcasm. You saw some overfitting in the initial attempt and hopefully, you were able to arrive at a better set of hyperparameters.\n",
    "\n",
    "So far, you've been tokenizing datasets from scratch and you're treating the vocab size as a hyperparameter. Furthermore, you're tokenizing the texts by building a vocabulary of full words. In the next lab, you will make use of a pre-tokenized dataset that uses a vocabulary of *subwords*. For instance, instead of having a unique token for the word `Tensorflow`, it will instead have a token each for `Ten`, `sor`, and `flow`. You will see the motivation and implications of having this design in the next exercise. See you there!"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "private_outputs": true,
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
